FEAST:快速准确的微生物来源追溯工具
不服就干!FEAST Source Tracker:快速准确的微生物来源追溯工具 百分百畅通版~
广东省科学院 徐锐 环境微生物方向
最近帮老板做项目想试下source tracker,刚好看到公众号推送了一篇相关攻略。不过实测以后发现“作者GitHub原版”和“文涛聊科研版”都有一些bug老是跑不通,于是自己仔细捋顺了一遍,希望能抛砖引玉,帮助小白们速入门~
一、感谢前人相关工作,同时大家也可对比一下~
1.【GitHub原版链接】
https://github.com/cozygene/FEAST (点击右侧绿色Download打包下载)
二、实测发现的Bugs
1.原版脚本中没有提供所依赖的各包,对小白不友好~
2.所需文件格式txt/csv没统一,换自己的数据时非常容易在Line68求和时报错:
(# COVERAGE = min(rowSums(otus[c(train.ix, test.ix),]))),很可能是面前表头读取错误导致。
3.原版测试数据data frame的行列个数有1处不对(其余都是4,有1个是3),导致后续将结果as.data.frame时报错。
图。原数据的bug
图。将原数据的计算结果as.data.frame的bug
4.文涛版提供了很好的结果可视化脚本,但可惜没有提供测试数据。
三、修改使用说明
1.将本修订脚本文件夹打包复制到工作目录下,其中已含有原版可供参考
2.菌群潜在来源定义为“Source”,待计算目标定义为“Sink”。
3.FEAST提供了2组脚本(计算1个sink时选“FEAST_example_one_sink.R”;计算多个sink时选“FEAST_example_Multiple_sinks.R”),2组脚本都依赖主程序“FEAST_scr/src.R”
4.其中多个sink的脚本需要设置参数“different_sources_flag”(每个sink的source不同时=1,相同时=0),具体选择流程图如下:
图。1个sink时的选择方式
图。多个sink时的选择方式
5.修改后的脚本统一使用csv数据格式
四、脚本注释与说明
1.几个重要的Argument
2.所需文件:meta_data(分组文件,要求见上流程图)和otu_data(样品的otu丰度表),在脚本中改好对应的文件名即可。
3.以最复杂的Multiple_sinks脚本为例,解析如下:
#加载必要的包
library("vegan")
library("dplyr")
library("doParallel")
library("foreach")
library("mgcv")
library("reshape2")
library("ggplot2")
library("cowplot")
library("Rcpp")
library("RcppArmadillo")
#准备
rm(list = ls())
gc()
#所有需要设置的变量,共3个
#Set the arguments of your data设置计算数据,最好都统一为csv格式
metadata_file = "all_meta.csv" #分组信息
count_matrix = "all_otu.csv" #otu表
EM_iterations = 1000 #default value=1000
##if you use different sources for each sink, different_sources_flag = 1, otherwise = 0
different_sources_flag = 0
# Load main code加载主程序
print("Change directory path")
dir_path = paste("C:/ ...your path/ FEAST/") #修改成FEAST文件夹所在目录
setwd(paste0(dir_path, "FEAST_src"))
source("src.R")
# Load sample metadata加载数据,仍旧统一为csv格式
setwd(paste0(dir_path, "Data_files"))
metadata <- read.csv(metadata_file, header=T, sep = ",", row.names = 1)
# Load OTU table
otus <- read.csv(count_matrix, header=T, sep = ",", row.names = 1)
otus <- t(as.matrix(otus))
#计算过程,不用管
# Extract only those samples in common between the two tables
common.sample.ids <- intersect(rownames(metadata), rownames(otus))
otus <- otus[common.sample.ids,]
metadata <- metadata[common.sample.ids,]
# Double-check that the mapping file and otu table
# had overlapping samples
if(length(common.sample.ids) <= 1) {
message <- paste(sprintf('Error: there are %d sample ids in common '),
'between the metadata file and data table')
stop(message)
}
if(different_sources_flag == 0){
metadata$id[metadata$SourceSink == 'Source'] = NA
metadata$id[metadata$SourceSink == 'Sink'] = c(1:length(which(metadata$SourceSink == 'Sink')))
}
envs <- metadata$Env
Ids <- na.omit(unique(metadata$id))
Proportions_est <- list()
for(it in 1:length(Ids)){
# Extract the source environments and source/sink indices
if(different_sources_flag == 1){
train.ix <- which(metadata$SourceSink=='Source' & metadata$id == Ids[it])
test.ix <- which(metadata$SourceSink=='Sink' & metadata$id == Ids[it])
}
else{
train.ix <- which(metadata$SourceSink=='Source')
test.ix <- which(metadata$SourceSink=='Sink' & metadata$id == Ids[it])
}
num_sources <- length(train.ix)
COVERAGE = min(rowSums(otus[c(train.ix, test.ix),])) #Can be adjusted by the user
str(COVERAGE)
# Define sources and sinks
sources <- as.matrix(rarefy(otus[train.ix,], COVERAGE))
sinks <- as.matrix(rarefy(t(as.matrix(otus[test.ix,])), COVERAGE))
print(paste("Number of OTUs in the sink sample = ",length(which(sinks > 0))))
print(paste("Seq depth in the sources and sink samples = ",COVERAGE))
print(paste("The sink is:", envs[test.ix]))
# Estimate source proportions for each sink
FEAST_output<-FEAST(source=sources, sinks = t(sinks), env = envs[train.ix], em_itr = EM_iterations, COVERAGE = COVERAGE)
Proportions_est[[it]] <- FEAST_output$data_prop[,1]
names(Proportions_est[[it]]) <- c(as.character(envs[train.ix]), "unknown")
if(length(Proportions_est[[it]]) < num_sources +1){
tmp = Proportions_est[[it]]
Proportions_est[[it]][num_sources] = NA
Proportions_est[[it]][num_sources+1] = tmp[num_sources]
}
print("Source mixing proportions")
print(Proportions_est[[it]])
}
print(Proportions_est)#原版仅可得到这个数据,可视化程度较差
#可视化过程,参考文涛脚本并修正若干bug
#输出计算结果
FEAST_output = as.data.frame(Proportions_est)
colnames(FEAST_output) = paste("repeat_",Ids,sep = "") #取Ids作为平行代号
head(FEAST_output)
filename = paste(dir_path,"Result/FEAST.csv",sep = "")
write.csv(FEAST_output,filename,quote = F)
#简单出图(每个repeat一张)
library(RColorBrewer)
library(dplyr)
library(graphics)
head(FEAST_output)
plotname = paste(dir_path,"Result/FEAST_repeat.pdf",sep = "")
pdf(file = plotname,width = 12,height = 12)
par(mfrow=c((length(unique(metadata$SampleType))%/%2 +2 ),2), mar=c(1,1,1,1))
# layouts = as.character(unique(metadata$SampleType))
for (i in 1:length(colnames(FEAST_output))) {
labs <- paste0(row.names(FEAST_output)," (", round(FEAST_output[,i]/sum(FEAST_output[,i])*100,2), "%)")
pie(FEAST_output[,i],labels=labs, init.angle=90,col = brewer.pal(nrow(FEAST_output), "Paired"),#最多可用12种颜色梯度
border="black",main =colnames(FEAST_output)[i] )
}
dev.off()
#简单出图(所有repeat求平均后出1张图)
head(FEAST_output)
asx = as.data.frame(rowMeans(FEAST_output))
asx = as.matrix(asx)
asx_norm = t(t(asx)/colSums(asx)) #* 100 # normalization to total 100
head(asx_norm)
plotname = paste(dir_path,"Result/FEAST_mean.pdf",sep = "")
pdf(file = plotname,width = 6,height = 6)
labs <- paste0(row.names(asx_norm)," (", round(asx_norm[,1]/sum(asx_norm[,1])*100,2), "%)")
pie(asx_norm[,1],labels=labs, init.angle=90,col = brewer.pal(nrow(FEAST_output), "Paired"),#最多12个颜色梯度
border="black",main = "mean of source tracker")
dev.off()
五、实测体会
1.关于速度:听说FEAST比SourceTracker快300倍,实测20个样×2000个otu的数据表计算仅需10分钟(i5, 4G),确实快!
2.关于准确性:采用A、B两组性质类似的数据计算(每组均含20个平行,A=source,B=sink,详见data_files/compare_otu和compare_meta),理论上source结果应接近100%,但实测约为70%~80%。是否准确就见仁见智了~
图。准确性对比
猜你喜欢
10000+:菌群分析 宝宝与猫狗 梅毒狂想曲 提DNA发Nature Cell专刊 肠道指挥大脑
文献阅读 热心肠 SemanticScholar Geenmedical
16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun
生物科普: 肠道细菌 人体上的生命 生命大跃进 细胞暗战 人体奥秘
写在后面
为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外5000+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份,另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助,首先阅读《如何优雅的提问》学习解决问题思路,仍未解决群内讨论,问题不私聊,帮助同行。
学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”